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algorithm by biasing the structure of obtained clusters. This paper proposed 
Keywords: a new overlapping clustering algorithm named OCKMEx, which uses k- 
median to identify overlapping clusters in the presence of outliers. This new 


Clustering method aims to determine the insensitivity of the OCKMEXx algorithm in 
Kmeans locating data points that overlap even with outliers. An experimental 
Kmedian evaluation of the algorithm was conducted wherein synthetic datasets served 
Outlier as a data source, and the F1 measure criterion was applied to assess the 
Overlapping OCKMEx algorithm performance. Results indicate that the OCKMEx 
algorithm implementing the use of k-median performed a higher accuracy 
rate of 100% in identifying data points that overlap even with outliers 
compared to the existing k-means algorithm. The algorithm exhibited a 
promising performance in identifying overlapping clusters and was resistant 

to outliers. 
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1. INTRODUCTION 

With the rise of Information technology also comes the upward movement of data in several fields 
[1]. Information technology and database research have served as a bridge to develop an approach to 
manipulate and store data which plays a significant role in coming up with important decisions in an 
organization or business. The procedure in which hidden knowledge is extracted from volumes of raw data 
using algorithms and techniques from statistics, machine learning, and database management system is 
referred to as data mining. Data mining in extensive data enables organizations and firm decisions by 
assembling, accumulating, and accessing corporate data [2]. In healthcare, a variety of of algorithms are 
developed and used [3] that cultivate the existing information into a useful data. 

Data mining is gaining popularity in disparate research fields due to its boundless applications and 
approaches to mine the data in an appropriate manner [4]. One such data mining technique is Clustering. 
Clustering refers to a set of objects grouped so that the things in the same group are more similar in some 
particular manner to each other compared with those in the other group [5]. Multiple research areas apply this 
technique, specifically data mining, statistical data analysis, machine learning, pattern recognition, image 
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analysis, and information retrieval [6]. Clustering serves as a significant area in data mining applications and 
data analysis. It is a specific operation that must be a proper subset of the other [7]. 

Most of the clustering algorithms generate exclusive clusters where each item could be a part of a 
single cluster only. In the field of medical datasets, specifically real-world data, this contains inherently 
overlapping information, which applies the method of overlapping clustering that permits one item to be a 
part of more than one cluster [5]. Another essential data mining issue is outlier detection, which identifies 
and removes data objects from a given data set. Outlier detection remains to be a research branch in data 
mining which plays a significant and extensive role because of its widespread use in a wide range of 
applications [8]. The outlier is the data item whose value falls outside the bounds in the sample data may 
indicate anomalous data [6]. Moreover, it is a data item whose values vary from the rest of the data or whose 
values fall outside the described range [9]. The detection of outliers translates to information that is 
significant and actionable in a wide variety of applications such as fraud detection [10], [11], intrusion 
detection in cybersecurity [12], and health diagnosis [13]. Like in the medical world, a normal body 
temperature fluctuation might signify an outlier [14]. Finding anomalous points among the data points is the 
basic idea to find out an outlier. Outlier detection is a significant research problem that intends to locate data 
objects that are considerably different, exceptional, and inconsistent in the database [15]. 

With this concern, the researchers introduced a new overlapping clustering with a k-median 
extension algorithm (OCKMEX). In statistics, the median is incredibly resistant to outliers. To deter the 
median from the bulk of the information requires at least 5096 of the data to be contaminated [16]. Through 
its use as the primary factor in the placement of cluster centers, k-medians can assimilate the robustness that 
the median provides. This new method aims to determine the insensitivity of the OCKMEX algorithm in 
locating data objects that overlaps in a cluster with the influence of outliers. 


2. METHOD 
2.1. Overlapping clustering with k-median extension algorithm 

This section explains the algorithm OCKMex by detecting data points that overlap within clusters. 
OCKMex is a new overlapping clustering algorithm that discovers the assignment of data points into more 
than one cluster. Maximum distance (maxdist) is applied, which acts as a global threshold in assigning 
objects to multiple groups. OCKMex comprises two separate phases, as shown in Figure 1. 


Input: Set of Data 
Output: Multi-membership of data 


Phase 1 
1. Input initial number of k clusters 
2. Calculate the distance between data points and cluster center 
using Manhattan distance measure. 
3. Assign datapoint to its nearest cluster center 
4. Re-calculate cluster center 
5. Ifnot converged, repeat from step 2 


Phase 2 


6. Save maxdist of data point assigned in a cluster 

7. Calculate the distance of each datapoint with other final 
cluster center using Manhattan distance measure 

8. Compare the distance of datapoint with maxdist of the 
final center 

9. ifdistance is less than maxdist. datapoint 
has multi-membership 


Figure 1. OCKMex algorithm 


PHASE 1: Segmentation of data points to form a cluster is done using the k-median algorithm. The 
objective of the k-median clustering algorithm is to find the distance between data points to its nearest cluster 
center using the 1-norm distance, which is called the Manhattan measure. First, the algorithm accepts the 
initial input of the k center as the initial representation of k points or the k median. Then, the algorithm 
assigns every point to its nearest median. The algorithm re-calculates the median using the median of each 
feature. This process iterates until the convergence criterion achieves the desired properties. The criterion 
function for the k-median algorithm [17] is defined in (1). 
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D(x, y) = X|xi — yıl (1) 


PHASE 2: Maxdist [18] identifies each data object and the cluster centers' distance. Maxdist is 
saved after the first actual run of k-median, which will also serve as a global threshold in determining the 
multiple memberships of data points in a cluster. A membership table was generated, including data object 
vectors designated to each cluster and their final cluster centroid. A value of 1 or 0 will be assigned to each 
data object in the membership table, denoting that it is a member or not a member of a cluster. The final k 
center's distance from the other cluster and the data points within another group is computed through 
iteration. Then maxdist of a cluster will now be compared with the calculated distance. Suppose the result of 
the distance comes up with a lesser value of the maxdist. In that case, the data object is assigned as a member 
of that cluster centroid, updating the membership table with a value of 1 representing cluster membership. 


2.2. Accuracy 

The researchers used Recall, Precision, and F-measure to evaluate the accuracy of overlapping 
clustering results. Precision is the fraction of correctly identified pairs in the same cluster, and recall is the 
fraction of actual pairs that were identified. The formula for precision and recall is shown in [19]. 


Number of Correctly Identif ied Linked Pairs 
Number of Identified Linked Pairs 


Q) 


Precision — 


Number of Correctly Identified Linked of Pairs 


Number of True Linked Pairs 


Recall = 


(3) 


To model the desired precision and recall, the F-measure, also referred to as the F1 score combining 
precision and recall, was also used. F-measurement calculates the weighted harmonic mean of recall and 
precision [20]. The higher the value for the F-measure, the better the detection accuracy, where 0 represents 
the worst and 1 illustrates a perfect detection [21]. The formula below defines the calculation of F-measure. 


2*RECALL*PRECISION 
PRESCISION+RECALL 


F, Score = 


(4) 


3. RESULTS AND DISCUSSION 

In this section, two experiments were conducted to test the developed OCKMex algorithm. The first 
experiment identifies if OCKMex algorithm can group data into clusters and locate points that belongs to 
multiple clusters. On the other hand, the second experiment aims to examine OCKMEX algorithm's accuracy 
performance in locating data points that overlap within clusters with outliers in comparison with the existing 
algorithm MCOKE. Synthetic datasets were used for the implementation of the algorithm. 


3.1. Experimentation 1 

The aim of experiment 1 is to test whether the OCKMEX algorithm can process the grouping of data 
samples into a cluster based on its similarity features and locate data points that belong to multiple 
memberships. The study used synthetic datasets with two attributes (Rating, Absences) with 20 instances 
with two considered linked pairs. Table 1 shows the experimental synthetic datasets. 


Table 1. Synthetic datasets 
Student Rating Absences Student Rating Absences 


Student 1 80 2 Student 11 72 7 
Student 2 90 2 Student 12 71 6 
Student 3 TA 3 Student 13 82 2 
Student 4 70 5 Student 14 83 2 
Student 5 75 3 Student 15 95 1 
Student 6 72 6 Student 16 90 1 
Student 7 73 7 Student 17 75 6 
Student 8 80 3 Student 18 70 8 
Student 9 90 2 Student 19 84 2 
Student 10 79 4 Student 20 83 3 


3.1.1. Phase 1 
In the study's first phase, the synthetic dataset was run with the k-median algorithm to segment the 
datasets into clusters. A user enters the number of K points as the initial cluster center assigned to each data 
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point. In this experiment, OCKMex takes an input of 2 clusters center and sets a data point to its nearest 
cluster center using Manhattan distance measure. Figure 2 shows the simulation result of two clusters in 2- 
dimensional spaces. 
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Figure 2. Simulation results of two clusters 


After the initial run of OCKMex, an membership table (MT) is generated, showing the vectors of all 
assigned data points and their cluster center. Each data point in MT represents a value of 1 or 0, which 
signifies its membership in the cluster. If a data point is assigned one, it means membership to that cluster; 
otherwise, 0 for non-membership. Table 2 shows the result of the MT obtained from the produced cluster 
using the k-median algorithm. 


Table 2. Two clusters membership table 


Vectors Cluster Center Cluster 1 Cluster 2 

0,2 83.0,2.0 1 0 
90, 2 83.0,2.0 1 0 
77,3 83.0,2.0 1 0 
70, 5 72.0,6.0 0 1 
75,3 72.0,6.0 0 1 
72,6 72.0,6.0 0 1 
73,7 72.0,6.0 0 1 
80, 3 83.0,2.0 1 0 
90, 2 83.0,2.0 1 0 
79,4 83.0,2.0 1 0 
7257 72.0,6.0 0 1 
71,6 72.0,6.0 0 1 
82,2 83.0,2.0 1 0 
83, 2 83.0,2.0 1 0 
95,1 83.0,2.0 1 0 
90, 1 83.0,2.0 1 0 
75,6 72.0,6.0 0 1 
70, 8 72.0,6.0 0 1 
84, 2 83.0,2.0 1 0 
83,3 83.0,2.0 1 0 

Total Count 12 8 


3.1.2. Phase 2 

In this phase, the researchers used each cluster's maximum distance (maxdist) as a threshold in 
discovering the belonging of data points to multiple groups. This time, the OCKMex algorithm iterates to 
calculate the distance of data points assigned to its primary cluster with another cluster center. If the 
estimated distance of the data points is less than maxdist, MT is modified to 1; otherwise, 0. As shown in 
Table 3, two (2) instances overlap with another cluster. 
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3.2. Experimentation 2 

This experiment aims to examine the accuracy performance of the OCKMEX algorithm in locating 
data points that overlap within clusters with outliers compared with the existing algorithm MCOKE [22]. The 
same synthetic data samples were used based on the first experiment to test the performance of the two 
algorithms. It comprises two attributes (Rating, Absences), with 25 instances. Five outliers are intentionally 
integrated into the data samples; thus, 20 instances are normal data with two linked pairs, and five are 
considered an anomaly in the datasets (Student 21 to Student 25). Table 4 shows the data samples. 


Table 3. OCKMEXx overlapping results Table 4. Synthetic dataset with outliers 
Vectors Cluster Center Cluster 1 Cluster 2 Student Rating Absences 

80, 2 83.0,2.0 0 0 Student 1 80 2 
90, 2 83.0,2.0 0 0 Student 2 90 2 
77,3 83.0,2.0 0 0 Student 3 77 3 
70,5 72.0,6.0 0 0 Student 4 70 5 
75,3 72.0,6.0 0 1 Student 5 75 3 
72,6 72.0,6.0 0 0 Student 6 72 6 
73,7 72.0,6.0 0 0 Student 7 73 7 
80,3 83.0,2.0 0 0 Student 8 80 3 
90,2 83.0,2.0 0 0 Student 9 90 2 
79,4 83.0,2.0 0 0 Student 10 79 4 
72,7 72.0,6.0 0 0 Student 11 72 7 
71,6 72.0,6.0 0 0 Student 12 71 6 
82, 2 83.0,2.0 0 0 Student 13 82 2 
83, 2 83.0,2.0 0 0 Student 14 83 2 
95,1 83.0,2.0 0 0 Student 15 95 1 
90, 1 83.0,2.0 0 0 Student 16 90 1 
75,6 72.0,6.0 0 1 Student 17 75 6 
70, 8 72.0,6.0 0 0 Student 18 70 8 
84, 2 83.0,2.0 0 0 Student 19 84 2 
83, 3 83.0,2.0 0 0 Student 20 83 3 
Total Overlap Count 0 2 Student 21 138 9 
Student 22 135 7 

Student 23 140 8 

Student 24 125 4 

Student 25 127 6 


In this experiment, two tests were conducted wherein one approach included the OCKMEX algorithm, 
and the other was the MCOKE algorithm. These two algorithms used different techniques in arranging data 
points to form a cluster. OCKMEXx algorithm uses k-median (minimization of 1-norm distance). In contrast, k- 
means (minimization of 2 -norm distance) [16] was applied for the MCOKE algorithm, but both algorithms use 
maxdist to detect data points that overlap within clusters. For the first experimental test, the OCKMEx 
algorithm inputs two clusters k center for the initial formation of clusters. Figure 3 shows the simulated data 
samples with outliers that were plotted through 2-dimensional space. As seen in the simulated results from 
Iteration 1 to Iteration 6 in Figure 4, the implementation of k-median in the algorithm is highly immune to 
outliers' influence to dissuade the k-centers away from the standard data samples. 

Iteration 1 in Figure 4(a) shows that the application of k-median algorithm brings high immunity to 
outliers allowing to obtain members of each cluster. In Figure 4(b) shown minimal movement on cluster 
center is seen in iteration 2 where the use of the algorithm provides resistance to outliers allowing the 
formation of clusters. Identification of cluster is still evident in Iteration 3 even in the presence of outliers in 
Figure 4(c). As shown in Figure 4(d) iteration 4 depicts the insensitivity of the algorithm in locating data 
objects with the influence of outliers. Iteration 5 demonstrates a minor change with the movement of the 
cluster center allowing the assignment of data points to clusters in Figure 4(e) and Figure 4(f) shown iteration 
6 shows no movement on cluster center displaying consistent resistance to outliers. 

The next step is to determine the multi membership of the data points. Tables 5 and 6 shows the 
membership table as the first test results conducted using the OCKMex algorithm. The test indicates that 
OCKMExX eventually detected two (2) instances in the data samples as members of two clusters. The same 
data samples were processed; this time MCOKE was used. Figure 5 shows how highly vulnerable is MCOKE 
with the existence of outliers. Based on the simulation results starting from Iteration 1 to Iteration 3, outliers 
can drastically pull one of the k centers values away from the rest of the expected data, making all outliers a 
cluster of outliers. 
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Figure 3. Initial formation of clusters with outliers 
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Figure 1. OCKMEx simulation results with outliers for: (a) Iteration 1, (b) Iteration 2, (c) Iteration 3, 
(d) Iteration 4, (e) Iteration 5, and (f) Iteration 6 
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Iteration 1 in Figure 5(a) shows a visible movement of cluster center to the normal data points. In 
Figure 5(b) shown cluster center is drastically drawn from normal data points displaying sensitivity to 
outliers in Iteration 2. Iteration 3 shows how highly vulnerable MCOKE is with the presence of outliers as 
new cluster of outliers was formed with the cluster center evidently pulled from the normal data as shown in 
Figure 5(c). 


Table 5. OCKMEx MT with outliers Table 6. OCKMEXx overlapping results 


Vectors Cluster Center Cluster 1 Cluster 2 Vectors Cluster Center Cluster 1 Cluster 2 
80,2 90.0,2.0 1 0 80, 2 90.0,2.0 0 0 
90,2 90.0,2.0 1 0 90, 2 90.0,2.0 0 0 
77,3 73.0,6.0 0 1 77,3 73.0,6.0 0 0 
70, 5 73.0,6.0 0 1 70,5 73.0,6.0 0 0 
75,3 73.0,6.0 0 1 75,3 73.0,6.0 0 1 
72,6 73.0,6.0 0 1 72,6 73.0,6.0 0 0 
73,7 73.0,6.0 0 1 73,7 73.0,6.0 0 0 
80, 3 73.0,6.0 0 1 80, 3 73.0,6.0 0 0 
90, 2 90.0,2.0 1 0 90, 2 90.0,2.0 0 0 
79,4 73.0,6.0 0 1 79,4 73.0,6.0 0 0 
72,7 73.0,6.0 0 1 72,7 73.0,6.0 0 0 
71,6 73.0,6.0 0 1 71,6 73.0,6.0 0 0 
82,2 90.0,2.0 1 0 82,2 90.0,2.0 0 0 
83,2 90.0,2.0 1 0 83,2 90.0,2.0 0 0 
95,1 90.0,2.0 1 0 95,1 90.0,2.0 0 0 
90,1 90.0,2.0 1 0 90, 1 90.0,2.0 0 0 
75,6 73.0,6.0 0 1 75,6 73.0,6.0 0 1 
70, 8 73.0,6.0 0 1 70, 8 73.0,6.0 0 0 
84, 2 73.0,6.0 0 1 84, 2 73.0,6.0 0 0 
83,3 90.0,2.0 1 0 83,3 90.0,2.0 0 0 
138,9 90.0,2.0 1 0 138,9 90.0,2.0 0 0 
135,7 90.0,2.0 1 0 135,7 90.0,2.0 0 0 
140, 8 90.0,2.0 1 0 140, 8 90.0,2.0 0 0 
125,4 90.0,2.0 1 0 125,4 90.0,2.0 0 0 
127,8 90.0,2.0 1 0 127,8 90.0,2.0 0 0 

Total Count 13 12 Total Overlap Count 0 2 
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Figure 5. MCOKE simulation results with outliers for: (a) Iteration 1, (b) Iteration 2, and (c) Iteration 3 
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Tables 7 and 8 show the membership of each datapoint to its clusters. Based on the membership 
results, using the existing MCOKE, undiscovered data points overlap within clusters. Various studies stressed 
that the presence of outliers in the data samples would significantly affect the model in producing correct and 
accurate results [23]-[25]. With the influence of the outliers, MCOKE was unsuccessful in identifying data 
points with multi membership, making the algorithm ineffective in discovering valuable and vital data. 

The summary results of the accuracy performance of the two experimentations performed using the 
synthetic data sample are shown in Table 9. Results indicate that the OCKMEX algorithm performed a higher 
accuracy rate of 100% in identifying data points that overlap even with outliers compared to the existing algorithm 
MCOKE. This proved that the OCKMEX algorithm is immune when outliers are mixed with actual values. 


Table 7. MCOKE MT results with outliers Table 8. MCOKE overlapping results 
Vectors Cluster Center Cluster 1 Cluster 2 Vectors Cluster Center Cluster 1 Cluster 2 
80,2 79.55, 3.75 80, 2 79.55, 3.75 
90, 2 79.55, 3.75 90, 2 79.55, 3.75 
77,3 79.55, 3.75 77,3 79.55, 3.75 
70,5 79.55, 3.75 70,5 79.55, 3.75 
75,3 79.55, 3.75 75,3 79.55, 3.75 
72,6 79.55, 3.75 72,6 79.55, 3.75 
73,7 79.55, 3.75 73,7 79.55, 3.75 
80, 3 79.55, 3.75 80, 3 79.55, 3.75 
90, 2 79.55, 3.75 90, 2 79.55, 3.75 
79,4 79.55, 3.75 79,4 79.55, 3.75 
72,7 79.55, 3.75 72,7 79.55, 3.75 
71,6 79.55, 3.75 71,6 79.55, 3.75 
82, 2 79.55, 3.75 82, 2 79.55, 3.75 
83, 2 79.55, 3.75 83, 2 79.55, 3.75 
95,1 79.55, 3.75 95, 1 79.55, 3.75 
90, 1 79.55, 3.75 90, 1 79.55, 3.75 
75,6 79.55, 3.75 75,6 79.55, 3.75 
70,8 79.55, 3.75 70, 8 79.55, 3.75 
84, 2 79.55, 3.75 84, 2 79.55, 3.75 
83,3 79.55, 3.75 83, 3 79.55, 3.75 


Uu —————coococoocococococoooooooooc 
U —-————cococcocccocccoccoccococoooooooococ 


I 
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138, 9 133, 6.8 138, 9 133, 6.8 
135, 7 133, 6.8 135, 7 133, 6.8 
140, 8 133, 6.8 140, 8 133, 6.8 
125,4 133, 6.8 125,4 133, 6.8 
127,8 133, 6.8 127,8 133, 6.8 
Total Count Total Count 
Table 9. Accuracy performance results 
Test Dataset Algorithm Cluster Outlier Overlap Precision Recall Fl-Measure 
Experiment! Synthetic OCKMEx 2 0 2 1.0 1.0 1.0 
OCKMEx 2 5 2 1.0 1.0 1.0 
Experiment2 Synthetic MCOKE 2 5 0 0 0 0 


4. CONCLUSION 

In this study, we introduced a new overlapping algorithm called OCKMex. This algorithm showed a 
better performance than the existing algorithms MCOKE for determining overlapping clusters and providing 
a more robust feature to outliers. Based on the results generated from experiments, OCKMex provided a 
higher accuracy rate in identifying overlapping clusters even with outliers. An algorithm is a beneficial tool 
for clustering data objects and identifying overlapping clusters. Even with promising results, the researchers 
should do additional experiments and testing. In detail, a new calculation to isolate outliers is to be 
considered by the researchers as well. Additionally, the researchers suggested having another method 
considered a future study capable of excluding and separating the occurrence of outliers. 
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